primary data
Large Language Models for Market Research: A Data-augmentation Approach
Wang, Mengxin, Zhang, Dennis J., Zhang, Heng
Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. Our method leverages transfer learning principles to debias the LLM-generated data using a small amount of human data. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.
- Asia > China (0.04)
- North America > United States > Texas > Dallas County > Richardson (0.04)
- North America > United States > Missouri > St. Louis County > St. Louis (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)
Cyber-Attack Technique Classification Using Two-Stage Trained Large Language Models
Understanding the attack patterns associated with a cyberattack is crucial for comprehending the attacker's behaviors and implementing the right mitigation measures. However, majority of the information regarding new attacks is typically presented in unstructured text, posing significant challenges for security analysts in collecting necessary information. In this paper, we present a sentence classification system that can identify the attack techniques described in natural language sentences from cyber threat intelligence (CTI) reports. We propose a new method for utilizing auxiliary data with the same labels to improve classification for the low-resource cyberattack classification task. The system first trains the model using the augmented training data and then trains more using only the primary data. We validate our model using the TRAM data1 and the MITRE ATT&CK framework. Experiments show that our method enhances Macro-F1 by 5 to 9 percentage points and keeps Micro-F1 scores competitive when compared to the baseline performance on the TRAM dataset.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (7 more...)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)
Deep non-parametric logistic model with case-control data and external summary information
Shi, Hengchao, Zheng, Ming, Yu, Wen
The case-control sampling design serves as a pivotal strategy in mitigating the imbalanced structure observed in binary data. We consider the estimation of a non-parametric logistic model with the case-control data supplemented by external summary information. The incorporation of external summary information ensures the identifiability of the model. We propose a two-step estimation procedure. In the first step, the external information is utilized to estimate the marginal case proportion. In the second step, the estimated proportion is used to construct a weighted objective function for parameter training. A deep neural network architecture is employed for functional approximation. We further derive the non-asymptotic error bound of the proposed estimator. Following this the convergence rate is obtained and is shown to reach the optimal speed of the non-parametric regression estimation. Simulation studies are conducted to evaluate the theoretical findings of the proposed method. A real data example is analyzed for illustration.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
An end-to-end framework for gene expression classification by integrating a background knowledge graph: application to cancer prognosis prediction
Inoue, Kazuma, Kojima, Ryosuke, Kamada, Mayumi, Okuno, Yasushi
Motivation: Biological data may be separated into primary data, such as gene expression, and secondary data, such as pathways and protein-protein interactions. Methods using secondary data to enhance the analysis of primary data are promising, because secondary data have background information that is not included in primary data. In this study, we proposed an end-to-end framework to integrally handle secondary data to construct a classification model for primary data. We applied this framework to cancer prognosis prediction using gene expression data and a biological network. Results: Cross-validation results indicated that our model achieved higher accuracy compared with a deep neural network model without background biological network information. Experiments conducted in patient groups by cancer type showed improvement in ROC-area under the curve for many groups. Visualizations of high accuracy cancer types identified contributing genes and pathways by enrichment analysis. Known biomarkers and novel biomarker candidates were identified through these experiments.
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Research Report > Experimental Study (0.95)
- Research Report > New Finding (0.88)
The ART of Transfer Learning: An Adaptive and Robust Pipeline
Wang, Boxiang, Wu, Yunan, Ye, Chenglong
Transfer learning is an essential tool for improving the performance of primary tasks by leveraging information from auxiliary data resources. In this work, we propose Adaptive Robust Transfer Learning (ART), a flexible pipeline of performing transfer learning with generic machine learning algorithms. We establish the non-asymptotic learning theory of ART, providing a provable theoretical guarantee for achieving adaptive transfer while preventing negative transfer. Additionally, we introduce an ART-integrated-aggregating machine that produces a single final model when multiple candidate algorithms are considered. We demonstrate the promising performance of ART through extensive empirical studies on regression, classification, and sparse learning. We further present a real-data analysis for a mortality study.
- Health & Medicine (1.00)
- Energy > Oil & Gas > Midstream (0.97)
- Materials > Chemicals > Industrial Gases > Liquified Gas (0.72)
- Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.72)
AIOps: What, Why, and How? - DZone AI
Since Gartner coined the term AIOps in 2016, artificial intelligence has become a buzzword in the advanced technological world. The goal of AIOps is to automate complex IT systems resolution while simplifying their operations. Simply put, AIOps is the transformational approach that uses machine learning and AI technologies to run operations such as event correlation, monitoring, service management, observability, and automation. With AIOps, you can collect and aggregate ever-increasing data generated from observability and monitoring systems, different applications, or infrastructure, filter the noise to identify events and patterns for system performance and availability issues, and determine root causes and often resolve them automatically or send the alert to the IT team. If you aren't using AIOps to complete the process, then it will become difficult to run alongside technology innovation taking place at a rapid pace. Besides, if you depend on traditional knowledge and old systems, your IT operations are more likely to become unpredictable and unscalable.
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.54)
Credit card fraud detection - Classifier selection strategy
Machine learning has opened up new tools for financial fraud detection. Using a sample of annotated transactions, a machine learning classification algorithm learns to detect frauds. With growing credit card transaction volumes and rising fraud percentages there is growing interest in finding appropriate machine learning classifiers for detection. However, fraud data sets are diverse and exhibit inconsistent characteristics. As a result, a model effective on a given data set is not guaranteed to perform on another. Further, the possibility of temporal drift in data patterns and characteristics over time is high. Additionally, fraud data has massive and varying imbalance. In this work, we evaluate sampling methods as a viable pre-processing mechanism to handle imbalance and propose a data-driven classifier selection strategy for characteristic highly imbalanced fraud detection data sets. The model derived based on our selection strategy surpasses peer models, whilst working in more realistic conditions, establishing the effectiveness of the strategy.
- North America > United States > California > Orange County > Irvine (0.04)
- Asia > Taiwan (0.04)
- Asia > India (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.68)
6 Reasons to Spend More Time Thinking About Labels
Quite a few of the issues should be addressed as part of an established machine learning operations. Some issues may be resolved through support functions such as legal, people, general data management and smart procedure design -- more on that at a later post. For now, let's focus on the all important labels, as opposed to the features.
Why AI Is Failing for Enterprises: Predetermination Bias
Our best practices for managing data are ancient. For tens of thousands of years we've managed the future by predetermining the resources we think we'll need, limiting our futures to what we can foresee. AI–especially machine learning–can learn from seemingly insignificant data, often to our human delight and surprise. So, it's time to rethink our processes and mindset when it comes to data. We need to stop limiting data to the futures we expect.
- Information Technology > Artificial Intelligence > Machine Learning (0.97)
- Information Technology > Data Science > Data Mining > Big Data (0.31)
Lightweight Data Fusion with Conjugate Mappings
Dean, Christopher L., Lee, Stephen J., Pacheco, Jason, Fisher, John W. III
We present an approach to data fusion that combines the interpretability of structured probabilistic graphical models with the flexibility of neural networks. The proposed method, lightweight data fusion (LDF), emphasizes posterior analysis over latent variables using two types of information: primary data, which are well-characterized but with limited availability, and auxiliary data, readily available but lacking a well-characterized statistical relationship to the latent quantity of interest. The lack of a forward model for the auxiliary data precludes the use of standard data fusion approaches, while the inability to acquire latent variable observations severely limits direct application of most supervised learning methods. LDF addresses these issues by utilizing neural networks as conjugate mappings of the auxiliary data: nonlinear transformations into sufficient statistics with respect to the latent variables. This facilitates efficient inference by preserving the conjugacy properties of the primary data and leads to compact representations of the latent variable posterior distributions. We demonstrate the LDF methodology on two challenging inference problems: (1) learning electrification rates in Rwanda from satellite imagery, high-level grid infrastructure, and other sources; and (2) inferring county-level homicide rates in the USA by integrating socio-economic data using a mixture model of multiple conjugate mappings.
- Africa > Rwanda (0.25)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- (20 more...)
- Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Banking & Finance > Economy (0.65)
- Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.34)